Search CORE

82 research outputs found

From Regular Expression Matching to Parsing

Author: Bille Philip
Gørtz Inge Li
Publication venue
Publication date: 29/01/2019
Field of study

Given a regular expression

R

and a string

Q

, the regular expression parsing problem is to determine if

Q

matches

R

and if so, determine how it matches, e.g., by a mapping of the characters of

Q

to the characters in

R

. Regular expression parsing makes finding matches of a regular expression even more useful by allowing us to directly extract subpatterns of the match, e.g., for extracting IP-addresses from internet traffic analysis or extracting subparts of genomes from genetic data bases. We present a new general techniques for efficiently converting a large class of algorithms that determine if a string

Q

matches regular expression

R

into algorithms that can construct a corresponding mapping. As a consequence, we obtain the first efficient linear space solutions for regular expression parsing

arXiv.org e-Print Archive

Online Research Database In Technology

Space-Efficient Re-Pair Compression

Author: Bille Philip
Gørtz Inge Li
Prezza Nicola
Publication venue
Publication date: 04/11/2016
Field of study

Re-Pair is an effective grammar-based compression scheme achieving strong compression rates in practice. Let

n

\sigma

, and

d

be the text length, alphabet size, and dictionary size of the final grammar, respectively. In their original paper, the authors show how to compute the Re-Pair grammar in expected linear time and

5n + 4\sigma^2 + 4d + \sqrt{n}

words of working space on top of the text. In this work, we propose two algorithms improving on the space of their original solution. Our model assumes a memory word of

\lceil\log_2 n\rceil

bits and a re-writable input text composed by

n

such words. Our first algorithm runs in expected

\mathcal O(n/\epsilon)

time and uses

(1+\epsilon)n +\sqrt n

words of space on top of the text for any parameter

0<\epsilon \leq 1

chosen in advance. Our second algorithm runs in expected

\mathcal O(n\log n)

time and improves the space to

n +\sqrt n

words

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Online Research Database In Technology

Immersive Algorithms: Better Visualization with Less Information

Author: Bille Philip
Gørtz Inge Li
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Crossref

Online Research Database In Technology

Subsequence Automata with Default Transitions

Author: Bille Philip
Gørtz Inge Li
Skjoldjensen Frederik Rye
Publication venue
Publication date: 01/01/2016
Field of study

Let

S

be a string of length

n

with characters from an alphabet of size

\sigma

. The \emph{subsequence automaton} of

S

(often called the \emph{directed acyclic subsequence graph}) is the minimal deterministic finite automaton accepting all subsequences of

S

. A straightforward construction shows that the size (number of states and transitions) of the subsequence automaton is

O(n\sigma)

and that this bound is asymptotically optimal. In this paper, we consider subsequence automata with \emph{default transitions}, that is, special transitions to be taken only if none of the regular transitions match the current character, and which do not consume the current character. We show that with default transitions, much smaller subsequence automata are possible, and provide a full trade-off between the size of the automaton and the \emph{delay}, i.e., the maximum number of consecutive default transitions followed before consuming a character. Specifically, given any integer parameter

k

1 < k \leq \sigma

, we present a subsequence automaton with default transitions of size

O(nk\log_{k}\sigma)

and delay

O(\log_k \sigma)

. Hence, with

k = 2

we obtain an automaton of size

O(n \log \sigma)

and delay

O(\log \sigma)

. On the other extreme, with

k = \sigma

, we obtain an automaton of size

O(n \sigma)

and delay

O(1)

, thus matching the bound for the standard subsequence automaton construction. Finally, we generalize the result to multiple strings. The key component of our result is a novel hierarchical automata construction of independent interest.Comment: Corrected typo

arXiv.org e-Print Archive

Online Research Database In Technology

Sparse Regular Expression Matching

Author: Bille Philip
Gørtz Inge Li
Publication venue
Publication date: 10/07/2019
Field of study

We present the first algorithm for regular expression matching that can take advantage of sparsity in the input instance. Our main result is a new algorithm that solves regular expression matching in

O\left(\Delta \log \log \frac{nm}{\Delta} + n + m\right)

time, where

m

is the number of positions in the regular expression,

n

is the length of the string, and

\Delta

is the \emph{density} of the instance, defined as the total number of active states in a simulation of the position automaton. This measure is a lower bound on the total number of active states in simulations of all classic polynomial sized finite automata. Our bound improves the best known bounds for regular expression matching by almost a linear factor in the density of the problem. The key component in the result is a novel linear space representation of the position automaton that supports state-set transition computation in near-linear time in the size of the input and output state sets

arXiv.org e-Print Archive

Random Access in Persistent Strings and Segment Selection

Author: Bille Philip
Gørtz Inge Li
Publication venue
Publication date: 11/02/2021
Field of study

We consider compact representations of collections of similar strings that support random access queries. The collection of strings is given by a rooted tree where edges are labeled by an edit operation (inserting, deleting, or replacing a character) and a node represents the string obtained by applying the sequence of edit operations on the path from the root to the node. The goal is to compactly represent the entire collection while supporting fast random access to any part of a string in the collection. This problem captures natural scenarios such as representing the past history of an edited document or representing highly-repetitive collections. Given a tree with

n

nodes, we show how to represent the corresponding collection in

O(n)

space and

O(\log n/ \log \log n)

query time. This improves the previous time-space trade-offs for the problem. Additionally, we show a lower bound proving that the query time is optimal for any solution using near-linear space. To achieve our bounds for random access in persistent strings we show how to reduce the problem to the following natural geometric selection problem on line segments. Consider a set of horizontal line segments in the plane. Given parameters

i

and

j

, a segment selection query returns the

j

th smallest segment (the segment with the

j

th smallest

y

-coordinate) among the segments crossing the vertical line through

x

-coordinate

i

. The segment selection problem is to preprocess a set of horizontal line segments into a compact data structure that supports fast segment selection queries. We present a solution that uses

O(n)

space and support segment selection queries in

O(\log n/ \log \log n)

time, where

n

is the number of segments. Furthermore, we prove that that this query time is also optimal for any solution using near-linear space.Comment: Extended abstract at ISAAC 202

arXiv.org e-Print Archive

Online Research Database In Technology

Fast Dynamic Arrays

Author: Bille Philip
Christiansen Anders Roy
Ettienne Mikko Berggren
Gørtz Inge Li
Publication venue
Publication date: 01/01/2017
Field of study

We present a highly optimized implementation of tiered vectors, a data structure for maintaining a sequence of

n

elements supporting access in time

O(1)

and insertion and deletion in time

O(n^\epsilon)

for

\epsilon > 0

while using

o(n)

extra space. We consider several different implementation optimizations in C++ and compare their performance to that of vector and multiset from the standard library on sequences with up to

10^8

elements. Our fastest implementation uses much less space than multiset while providing speedups of

40\times

for access operations compared to multiset and speedups of

10.000\times

compared to vector for insertion and deletion operations while being competitive with both data structures for all other operations

arXiv.org e-Print Archive

Online Research Database In Technology

Distance labeling schemes for trees

Author: Alstrup Stephen
Gørtz Inge Li
Halvorsen Esben Bistrup
Porat Ely
Publication venue
Publication date: 14/07/2015
Field of study

We consider distance labeling schemes for trees: given a tree with

n

nodes, label the nodes with binary strings such that, given the labels of any two nodes, one can determine, by looking only at the labels, the distance in the tree between the two nodes. A lower bound by Gavoille et. al. (J. Alg. 2004) and an upper bound by Peleg (J. Graph Theory 2000) establish that labels must use

\Theta(\log^2 n)

bits\footnote{Throughout this paper we use

\log

for

\log_2

.}. Gavoille et. al. (ESA 2001) show that for very small approximate stretch, labels use

\Theta(\log n \log \log n)

bits. Several other papers investigate various variants such as, for example, small distances in trees (Alstrup et. al., SODA'03). We improve the known upper and lower bounds of exact distance labeling by showing that

\frac{1}{4} \log^2 n

bits are needed and that

\frac{1}{2} \log^2 n

bits are sufficient. We also give (

1+\epsilon

)-stretch labeling schemes using

\Theta(\log n)

bits for constant

\epsilon>0

. (

1+\epsilon

)-stretch labeling schemes with polylogarithmic label size have previously been established for doubling dimension graphs by Talwar (STOC 2004). In addition, we present matching upper and lower bounds for distance labeling for caterpillars, showing that labels must have size

2\log n - \Theta(\log\log n)

. For simple paths with

k

nodes and edge weights in

[1,n]

, we show that labels must have size

\frac{k-1}{k}\log n+\Theta(\log k)

arXiv.org e-Print Archive

Copenhagen University Research Information System

Dagstuhl Research Online Publication Server

Online Research Database In Technology